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Abstract 

Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the 
CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations 
for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of 
energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 10 8 
MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the 
slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may 
result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation 
shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with 
a parallelised MD simulation using 64 distributed cores. 
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1. Introduction 



In recent years, the computational power of graphics pro- 
cessing units (GPUs) has increased rapidly: the theoretical peak 
Q performance for single precision floating-point operations on 
— 'an amateur's GPU^jreaches nearly lTflop/s. Compared to a 
single core of a conventional processor, this gives rise to an ex- 
pected performance jump of one or two orders of magnitude for 
many demanding computational problems, fueling the desire to 
exploit graphics processors for scientific applications. While 
conventional high-performance computing (HPC) depends on 
expensive central computing clusters, shared amongst many re- 
searchers, GPU computing makes local clusters acting as small 
HPC facilities conceivable for large-scale simulations at a frac- 
tion of the cost. 

A modern GPU works as a streaming processor implement- 
ing the single-instruction-multiple-threads (SIMT) model. One 
device contains several hundred scalar processor cores execut- 
ing a single instruction or a small set of divergent instructions 
in parallel in thousands of threads on a large data set. Parallel, 
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coalesced access to the onboard device memory via a memory 
interface of up to 512 bits provides a remarkably high memory 
bandwidth. 

Molecular dynamics (MD) simulations — a widespread tool 
for particle-based simulations in biological physics, chemistry, 
and material science — are ideally applicable to the GPU due to 
their inherent parallelism. The release of NVIDIA's compute 
unified device architecture (CUDA) as a convenient means of 
GPU programming has triggered a lot of activity in exploiting 
GPUs for scientific applications. Several MD implementations 
using GPUs have demonstrated significant speedups, with per- 
formance measurements based on relatively short test runs Q]- 
13. A critical review of current attempts to exploit GPUs in 
MD simulations may be found in Ref. 6, in essence, published 
results are not yet available, showing the "nascent nature" of 
the field. As a notable exception, complex MD simulations tar- 
geting at protein folding were accelerated by GPUs allowing 
for the trajectories to reach into the millisecond regime 018]. 
In the realm of physics, we are only aware of a Monte-Carlo 
study of the critical behaviour of the Ising model that was per- 
formed on the GPU, reporting speedups between 35 and 60 for 
the mostly integer arithmetic-based algorithm [9]; a multi-GPU 
approach to this problem shows a promising scalability ITOl . 

In this article, we describe MD simulations that are fully im- 
plemented on the GPU and that faithfully reproduce the glassy 
dynamics of a supercooled binary mixture. If a glass-forming 
liquid is cooled or compressed, the structural relaxation criti- 
cally slows down by several orders of magnitude and a rapidly 
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growing time scale emerges close to the glass transition line IfTTII . 
Small changes in the temperature may already have drastic ef- 
fects on the dynamics, and a numerical study thus requires ex- 
cellent long-time stability with respect to energy conservation 
over long simulation runs of 10 8 MD integration steps and more. 
The single precision floating-point arithmetic provided by re- 
cent GPUs turns out to be a major obstacle to this goal. But 
once this limitation is overcome, general-purpose computing 
on GPUs provides a useful tool to address current questions 
of the glass transition with minimal allocation of hardware and 
reasonable computing time. 

The substantial enhancement of computing resources due to 
GPU computing facilitates the investigation of very large sys- 
tems, which is desirable for studies of phenomena associated 
with a divergent length scale. Only recently, evidence for a 
divergent static correlation length in supercooled liquids was 
found in large-scale simulations of up to 64,000 particles Ifl2l 
and 80,000 particles [131, respectively. 

The article is organised as follows. We describe a high- 
precision implementation for the GPU in Section [2] followed 
by performance benchmarks in Section [3] A detailed analy- 
sis of the long-time stability regarding momentum and energy 
conservation is given in Section |4] Section [5] demonstrates the 
impact of numerical accuracy on the simulation results for the 
glassy dynamics of a binary mixture of soft spheres. A conclu- 
sion is given in Section [6] 

2. Implementation for the GPU 

2.1. Architecture of the GPU hardware 

An MD implementation for the GPU needs to be adapted 
to its vastly different architecture. A modern GPU consists of 
hundreds of scalar cores, which map to execution threads in the 
CUDA architecture and require fine-grained parallelisation of 
the algorithm. The scalar cores are divided into multiprocessors — 
units of 8 processors for the NVIDIA G200 series GPUs — 
capable of executing an atomic warp of 32 threads in four clock 
cycles. Each multiprocessor is equipped with a total of 16,384 
32-bit registers, and a fast, tiny shared memory of 16kB; all 
multiprocessors access a large global device memory of 1 to 
4 GB, which is two orders of magnitude slower than local reg- 
isters. To hide latencies when accessing the global device mem- 
ory, a scheduler concurrently executes multiple warps on a mul- 
tiprocessor. The equivalent of a multiprocessor in the CUDA 
architecture is a block of up to 512 threads, and the only means 
of communication and synchronisation is within a block. For 
the execution of complex algorithms, the number of threads per 
block has to be lowered due to the limited number of regis- 
ters available per multiprocessor. Global device memory en- 
forces a strict memory access pattern, where threads have to ad- 
dress memory in a linear coalesced order. In contrast to shared 
memory access, random read and write access to global device 
memory entails a performance penalty of an order of magni- 
tude. This impairs GPU acceleration of many common com- 
putational primitives such as sorting algorithms. As a partial 
remedy a texture cache of up to 8 kB per multiprocessor, which 



stems from the graphics heritage of the GPU, enables read-only 
random access to a small window of global device memory. A 
global cache of 64 kB of constant memory provides access to 
constant data at speeds comparable to register access. 

2.2. The MD integration step 

A soft-sphere molecular dynamics (MD) simulation solves 
Newton's equations in discretised time for N classical particles 
interacting via a C 2 -continuous, short-ranged potential. Ev- 
ery MD step, the force on each particle exerted by all inter- 
acting particles is calculated, and the particles are propagated 
by a symplectic integrator, the velocity- Verlet algorithm |14|. 
A naive implementation based on a doubly nested force loop 
would yield an algorithmic complexity of 0(N 2 ). For short- 
ranged interparticle forces however, a linear scaling of the per- 
formance with the number of particles can be achieved with 
Verlet neighbour lists. For each particle, a list of particles lo- 
cated within a radius r c , the cutoff radius of the potential, is kept 
in memory; the force algorithm then considers only particles in 
the neighbour list. A small "skin" of a fraction of r c is added to 
the neighbour sphere to reduce the necessity of rebuilding the 
neighbour lists to every 10 to 100 steps. Particle binning — the 
decomposition of the system into spatial cells — avoids a dou- 
bly nested loop in the neighbour list algorithm by limiting the 
search for neighbours of a particle to its own and the adjacent 
cells. Our implementation for the GPU combines and extends 
concepts described in detail in Refs.Q]and[2] 

Parallelisation of the velocity- Verlet algorithm for the GPU 
is straightforward and naturally respects coalesced memory ac- 
cess: each CUDA thread is assigned a single particle with the 
task of updating the velocities and coordinates. The imple- 
mented force and neighbour list algorithms resemble the ones 
proposed in Ref. |T| Specifically in the force algorithm, each 
thread reads the indices of a particle's neighbours in a coa- 
lesced manner, and coordinates are then fetched using the tex- 
ture cache, mitigating performance penalties due to random mem- 
ory access. 

For particle binning, we have implemented a cell list algo- 
rithm based on a parallel radix sort 03] [16]. Each particle is as- 
signed a one-dimensional integer cell index. An array contain- 
ing the particle numbers is sorted according to the cell indices, 
as proposed in the Particles example of the CUDA SDK ifTTl . 
A further pass determines the start index of each cell, followed 
by the assembly of fixed-size cell lists for the subsequent neigh- 
bour list algorithm. With this method we avoid the bottleneck of 
a host to device memory copy which is needed with the CPU- 
based cell list algorithm of Ref. 1 The radix sort algorithm 
performs worse on the GPU than on the CPU for small systems 
of considerably less than 10 4 particles, and further, particle bin- 
ning is only necessary every 10 to 100 steps. Thus, the overall 
performance of the MD step is best for systems of 10 5 and more 
particles. 

The neighbour list algorithm finally reads the particle in- 
dices from the fixed-size cell lists to gather coordinates and ve- 
locities of particles in neighbouring cells via texture fetches, 
temporarily stores them in shared memory, and builds a fixed- 
size neighbour list fl]. 
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Figure 1 : Hilbert's space-filling curve in three dimensions after 1 , 2 and 3 recursions. It provides the mapping between three-dimensional space and one-dimensional 
memory, which ensures texture locality of neighbouring particles. Particles at vertices with similar colour will be close in memory after the sort. 



A parallel reduction schem^jon the GPU is used to cal- 
culate properties which are needed every MD step such as the 
maximum absolute particle displacement and the potential en- 
ergy sum as well as less frequently evaluated properties such as 
temperature, centre-of-mass velocity, and the virial tensor. 

In the force algorithm, we make efficient use of the tex- 
ture cache by periodically sorting the particles in global device 
memory according to a three-dimensional space-filling Hilbert 
curve |Q~1[T8). Such a curve provides a mapping between 3D 
space and ID memory to optimally conserve spatial locality. 
In our implementation, the mapping is recursively generated 
on the GPU using 8 vertex rules Ifl9l as depicted in Fig. [T] 
based on a lattice spacing of <x with a maximum recursion depth 
of 10. The subsequent particle sort employs the radix sort algo- 
rithm lfT5"l[l6l to generate a permutation array; then read access 
to the texture cache and coalesced write access to global mem- 
ory are used to efficiently permute particle coordinates and ve- 
locities. 

2.3. Random number generator 

A modified Andersen thermostat for pre-equilibration cool- 
ing is realised by assigning Boltzmann-distributed velocities 
to all particles every (p.8i)~ x steps according to a fixed heat 
bath collision rate /j. Furthermore, the assigned velocities are 
rescaled by means of parallel reduction to exactly match the 
temperature of the heat bath. We have implemented the parallel 
rand48 random number generator, which may be trivially par- 
allelised by leap-frogging within the sequence 0. The linear 
recurrence x t+ j = Ajx t + Cj mod 2 48 uses a leapfrog multi- 
plier A T = a T mod 2 48 and leapfrog addend Cj — c £„=o a " 
mod 2 48 to jump within the sequence depending on the total 
number of threads T. As an improvement compared to Ref.|2] 
we seed the parallel generator using the parallel prefix sum al- 
gorithm^] which yields the initial state x t for each thread t by 
binary-tree summation and multiplication. Thus, initialisation 
times of many seconds are avoided for large systems. 

2.4. Double-single precision floating-point arithmetic 

It will be demonstrated below that numerical long-time sta- 
bility requires double precision arithmetic in critical parts of the 

4 See example "Reduction" in Ref. 1171 

5 See example "Scan" in Ref. 1171 



Algorithm 1 Algorithm for the addition of two floating-point 
numbers in double-single precision, based on the DSFUN90 
package ll25l . The subscripts and 1 refer to the high and low 
word of a double-single float, respectively. 

{Compute a + b using Knuth's trick.) 

to <— flo + bo 

e <— t - ao 

h <- ((b -e) + (ao - (t - e))) +a l +b 1 

{The result is to + h, after normalisation.) 
c <- e <- to + t\ 
c\ <- t\ - (e - t ) 



MD step. Currently prevalent GPUs of the NVIDIA GT200 se- 
ries feature only 1 double precision floating-point unit per every 
8 single precision units, which prohibits the use of native dou- 
ble precision in performance-critical algorithms. 

The limited native precision is overcome by algorithms which 
implement multi-precision arithmetic using two native floating- 
point registers [20, 21]. On hardware supporting the IEEE 754- 
1985 floating-point standard [22], proofs of numerical exact- 
ness have been given for multi-precision addition and multipli- 
cation 11231 . For the GPU, which does not fully comply with 
IEEE 754-1985 in terms of rounding and division, these proofs 
have been adapted for double-single precision arithmetic using 
two native single precision floats [24]. The double-single pre- 
cision algorithms for addition and multiplication with NVIDIA 
GPUs yield an effective precision of 44 bits. 

For IEEE 754-1985 compatible hardware, the DSFUN90 
package for Fortran [25] contains implementations of addition, 
subtraction, multiplication, division, and square root in double- 
single precision. We have ported these implementations from 
Fortran to CUDA, extending the functionality provided by the 
Mandelbrot example of the CUDA SDK IfTTl . As an example 
the addition algorithm in double-single precision is displayed 
in Table [T] To implement double-single multiplication on the 
GPU, care has to be taken to avoid fused multiply-add opera- 
tions, as the GPU does not round the result of the comprised 
multiplication, in violation of the floating-point standard. As a 
remedy, CUDA provides the intrinsic operations __fmuLrn and 
-JadcLrn. 
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Figure 2: Illustration of the blocking scheme for k = 4 blocks of size I = 5. 
Each square represents the system state at a given time; the numbers refer to the 
finest time resolution At. Dashed arrows indicate the filling of the higher block 
levels with lower resolution. Solid lines indicate the correlations calculated 
between the most left entry and all other entries of the same block. The thin 
squares to the left have already been processed and were discarded, while those 
to the right still have to be inserted into the blocks. 

In double-single precision, we have implemented the up- 
date of velocities and coordinates in the velocity- Verlet step 
as well as the summing over force contributions from neigh- 
bouring particles. These may be subject to accumulated sum- 
ming errors depending on the particular time step and potential, 
respectively. The evaluation of the force contributions itself, 
which accounts for a large fraction of the computational cost 
within the MD step, remains in single precision. 

2.5. Evaluation of time-correlation functions 

The dynamic properties of a molecular system are often 
quantified in terms of time-correlation functions [26|. For ob- 
servables A and B, one defines their correlation at different 
times t and s as 

C AB (t, s) = (A(t)B*(s)) 

1 r T 

= lim - drA(f + t) B*(s + r). (1) 
r ^°° T Jo 

In equilibrium, time-correlation functions are stationary and de- 
pend only on the difference t - s, thus Cab(?, s) = CabU - s). 
The most important class of time-correlation functions is com- 
prised by the a«fo-correlation functions Caa(0- If the system 
is ergodic, the average may be evaluated alternatively as an en- 
semble average over initial conditions, i.e., over independent 
simulation runs. For tagged-particle observables, the statistical 
error is decreased additionally by averaging over all particles of 
the same species. 

We have developed a blocking scheme which allows the on- 
line evaluation of time-correlation functions during the produc- 
tion run. The scheme was inspired by the "order-n algorithm" 
of Ref. [27J Complex relaxation processes often comprise sev- 
eral time scales and are usually discussed on a logarithmic time 
axis. The blocking scheme yields the correlation functions for 
fast and slow processes simultaneously and provides a time grid 
already suitable for a logarithmic representation. Introducing 



some short-time resolution Af, we assume that the state of the 
system is stored at times mAt for m = 0, 1, 2, ... , from which 
the observables A m = A(mAf) are calculated. For a long sim- 
ulation run over a time span of T = MAt, we approximate the 
integral in Eq. ([T]) by a sum, 

. M'-i 

C%> := C AA (mAt) * — £ A m+j A* (2) 

7=0 

where M' — M — m. The evaluation of Crjj for all m would 
require handling a huge number of M copies of the system state 
at different times (or of the derived observables at least) and 
would involve 0(M 2 ) floating-point operations. Instead, we ar- 
range the time grid in k blocks of size £, where the time reso- 
lution between subsequent blocks is increased by a factor of (, 
see Fig. [2] Within block «, correlations are calculated for time 
lags At„, ...,{{- Y)At n only, where At„ = ("At. The blocks are 
continuously filled during the simulation. Whenever a block is 
full, the first entry is correlated with all other entries and is then 
discarded. Thus, each block contains a section of the trajectory, 
moving forward as the simulation progresses. The memory re- 
quirement to handle a trajectory of length ( k At is merely k x I 
snapshots of the system state. 

3. Performance measurements 

A central argument for the use of GPUs in high-performance 
computing is their high theoretical peak performance compared 
to that of a single core of a conventional Opteron or Xeon CPU. 
We did extensive performance measurements of our MD im- 
plementation, which we compare first to our own serial refer- 
ence implementation for the host processor. The comparison 
between a massively parallel implementation and a serial one is 
somewhat unfair, in particular in view of the multi-core nature 
of current CPUs. Thus, we secondly provide a comparison with 
the LAMMPS package [28], being one of the established, par- 
allelised MD packages widely used in the physics community. 
The test includes the parallel use of multi-core CPUs within a 
single compute node as well as distributed computing over sev- 
eral nodes, which is more realistic for production use and allows 
for a more reasonable comparison between a conventional clus- 
ter and a single GPU; a similar approach was taken recently by 
Harvey et al. (8). LAMMPS offers some GPU acceleration as 
well which we have not used by purpose and which we explic- 
itly do not refer to. 

Our measurements were done on Lennard-Jones liquids of 
varying size and density with the conventional 12-6 interac- 
tion potential, V u (r;a,e) = 4s[(r/cr)~ 12 - (r/cr) -6 ], cut off at 
r c = 2.5cr; in LAMMPS we used the potential implementation 
termed Ij/cut. Units of length, mass, time, and energy are <x, m, 
^mo- 2 /s, and s as usual, and dimensionless quantities are indi- 
cated by an asterisk. We used a step size of 5t* = 0.001 and a 
neighbour list skin of width 0.5cr for the GPU and 0.3cr for the 
serial CPU and parallel LAMMPS benchmarks. To generate a 
homogeneous system, an initial fee lattice was equilibrated in 
the NVT ensemble at T* — k^T/s = 1.5 for 10 4 steps. Then, 
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Figure 3: Results of the performance benchmarks for a Lennard-Jones liquid of various densities as function of the number of particles JV; temperature was kept 
fixed at T* = 1.5. Large panels show the total time T per MD step, while insets test the expected linear scaling and display r/N. Panel a: average time required for a 
single MD step on the GPU (single precision); panel b: time per MD step on a serial host implementation; note the different scale of the y-axis compared to panels 
a and d; panel c: relative GPU performance versus a single CPU core; panel d: time per MD step using the parallelised LAMMPS package for jobs of various 
process numbers N p . For each density and particle number, the fastest result is shown, where symbols indicate the selected N p . 
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the wall time spent on the MD step was measured for the next 
10 4 steps. For the GPU and parallel CPU benchmarks, the re- 
sults were averaged over 10 independent realisations. For the 
LAMMPS benchmarks, only a single system was generated and 
the results were averaged over two consecutive measurement 
runs of 10 4 steps each. The investigated system sizes ranged 
from N = 1,372 to 864,000 particles and the densities from 
q* = 0.2 to 1.2. 

The GPU benchmarks were run on GPUs of type NVIDIA 
GeForce GTX 280, which contains 240 scalar processor cores 
clocked at 1.3 GHz, with GDDR3 memory at 1.1 GHz provid- 
ing a device memory bandwidth of 140 GB/s (=2x1.1 GHz x 
5 12 bits). Its theoretical peak single precision floating-point 
performance based on one fused multiply-add operation and 
one concurrent multiply operation per cycle is 933 Gflop/s (= 
240 x 3 flop x 1296 MHz). For comparison with our host ref- 
erence implementation, we used an AMD Opteron 2216 HE 
processor at 2.4 GHz with PC2-4200 CL4 dual-channel mem- 
ory and a theoretical memory bandwidth of 17 GB/s (= 2 x 
2 X 533 MHz x 64 bits). The theoretical peak performance on 
the assumption of a single SSE instruction execution with four 
concurrent single precision floating-point operations per cycle 
is 9.6 Gflop/s per CPU core. This roughly yields a factor of 100 
for the theoretical limit on the speedup of a compute-bound al- 
gorithm, and a factor of 8 for an algorithm limited by memory 
bandwidth. The estimate does not take into account the mem- 
ory latency on the GPU, which is 400-600 clock cycles for a 
thread accessing global memory [29 1, and may more substan- 
tially constrain the overall performance of an algorithm than 
floating-point performance or memory bandwidth. Further, re- 
cent CPUs contain several cores, and typical nodes in a comput- 
ing centre are found to be 4- to 16- way SMP machines, and one 
has to put these numbers into perspective for a parallelised im- 
plementation making use of all available cores simultaneously. 

The GPU benchmarks were performed with HAL's MD pack- 
age [30, commit e6 11734] using single floating-point preci- 
sion on the GPU for comparison with other work, i.e., without 
the implementation of double-single precision described above. 
The GPU-specific CUDA code was compiled with NVIDIAs 
CUDA compiler (version 2.2) targeting compute capability 1.0. 
The host-specific code was compiled for the x86_64 architec- 
ture with the GNU C/C++ compiler (version 4.3.4) with opti- 
misation flag 03. The serial CPU benchmarks were done with 
HAL's MD package ||30l commit a628797]; it was compiled 
with single floating-point precision for x86_64 using the GNU 
compiler again with optimisation flag 03, which implies auto- 
matic vectorisation of loops. On the x86_64 architecture, GCC 
defaults to SSE floating-point arithmetic, which ensures that 
throughout a calculation, a floating-point value is stored with 
the precision mandated by its data type. 

The compute times per MD step obtained on the GPU and 
the host are compared in Figs. [3^ and b, respectively. They 
are proportional to the number of particles N, and the double- 
logarithmic representation nicely corroborates the linear com- 
plexity of the simulation algorithm. In the GPU case, the lin- 
ear scaling is only obeyed for sufficiently large N > 20,000; 
for smaller N, the compute times show a constant offset of 



0.2 - 0.5 ms, reflecting the overhead of parallel sorting and re- 
duction. The prefactor increases for denser systems, which we 
attribute to larger neighbour lists. At p* = 0.4, we measure a 
GPU time per MD step per particle of 24 ns; this value dou- 
bles to 53 ns for the highest density, p* = 1.2. For compari- 
son, the timings reported in Refs.[T]and|2]at p* = 0.4 are 62 ns 
and 200 ns per MD step and particle using the older NVIDIA 
GeForce 8800 GTX hardware with only 128 CUDA cores at 
1.5 GHz. 

For the described double-single precision implementation, 
the execution times on the GPU increase modestly by about 
20%. The dependence on system size and density is similar to 
the results for the single precision implementation in Fig. [3^. 
Thus, the performance penalty for using single-double preci- 
sion in critical parts of the algorithm results roughly in a global 
factor. A significant part of the additional execution time seems 
to be related to memory latency. Using double-single precision, 
twice the amount of data is read and written for velocity and po- 
sition vectors. For the three-dimensional velocity vectors, this 
totals to 3 x 2 = 6 32-bit words per particle, requiring at least 
two coalesced memory operations by using, e.g., two arrays of 
float4. In contrast, the velocity vector can be accessed with a 
single coalesced operation for single precision. We expect that 
these considerations hold also if native double precision on the 
GPU is used, dropping the additional floating-point operations 
for the double-single arithmetic. CUDA devices of compute 
capability 2.0 only support texture read operations of up to 128 
bytes, which would entail splitting position and velocity vec- 
tors into properly aligned double2 and double components, and 
thus require twice the amount of memory accesses per vector. 

On the host processor, the MD step time of the serially ex- 
ecuted programme is proportional to N for small system sizes 
too (Fig. [3|3). With increasing system size, the execution times 
increase slightly, probably due to a growing number of cache 
misses. In particular, performance degraded substantially (by 
a factor of 2) for systems of more than 10 5 particles if the 
particles were randomly distributed in memory. Thereby, the 
obtained speedups were spoiled considerably, which we have 
solved by regularly sorting the particles in memory as in the 
GPU implementation. The prefactor spreads by about 60% 
around its average, which is somewhat larger than on the GPU 
(40%). At p* = 0.4, we have measured an MD step time per 
particle of 1.8 [is, which is comparable to the timings obtained 
below with the LAMMPS package on a single processor core. 

The relative performance gain of the GPU over the CPU is 
displayed in Fig. [3};. It depends to some extend on the parti- 
cle density such that dense systems are favoured by the GPU. 
While the speedup factor is as small as 4 to 12 for small systems 
of just 10 3 particles, it goes up to values around 40 for 10 4 par- 
ticles, and it reaches an approximate plateau between 70 and 
80 for more than 10 5 particles, being close to the theoretical 
limit for a compute-bound algorithm. The LAMMPS bench- 
marks were performed at the Leibnizrechenzentrum in Munich 
on 8-way nodes containing 4 dual-core processors of type AMD 
Opteron 8218 HE 2.6 GHz. The tested version of the package 
was released on 9 January 2009, it was compiled with Intel's 
C++ compiler (version 10.1) using the optimisation flags 02, 
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ipo, unroll, and fstrict-aliasing. The programme was run in par- 
allel mode using the MPI interface, the nodes were connected 
via 10 Gigabit Ethernet, and the MPI library used was from 
Parastation. We have tested configurations with 8, 16, 24, 32, 
48, and 64 processes. For each particle number and density, the 
fastest configuration was selected, and these smallest MD step 
times are shown in Fig.[3]l; only configurations with 1, 2, 6, and 
8 nodes are relevant for particle numbers between 10 3 and 10 6 . 
For system sizes 10 3 < N < 10 4 , a single node (8 processes) 
is favourable. Although the parallelisation overhead is signif- 
icantly smaller than for the GPU, the measured times per MD 
step exceed those on the GPU for more than 4,000 particles. 
Restricting the configuration to a single 8-way node even for 
larger systems, the execution times for 10 s particles are found 
to be 5 to 10 times slower than for the GPU, depending on den- 
sity; for higher particle numbers, the single-node performance 
goes drastically down, probably due to frequent cache misses. 
For a large system of 10 4 < N < 10 5 particles, the fastest 
execution is obtained with 48 processes, but the time per MD 
step and particle is about 2 times slower compared to the GPU, 
in particular at high densities. For huge systems, N > 10 s , 
the absolute numbers and the spread of the execution times are 
comparable for the GPU and 64 processes. In conclusion, a sin- 
gle GPU outperforms a conventional cluster by a factor of 2 for 
large systems and performs similarly as 8 recent 8-way nodes 
for huge system sizes. 

4. Numerical long-time stability 

We have thoroughly tested the numerical long-time stabil- 
ity of our implementation with respect to momentum and en- 
ergy conservation. In Section [3] we will show how a drift in 
these quantities may affect the dynamics of the simulated sys- 
tem. Numerical errors will be larger if only single precision 
arithmetic is used, but the pace at which numerical errors ac- 
cumulate also depends on the form of the potential. We will 
compare the Lennard-Jones potential and its purely repulsive 
part, also known as Weeks-Chandler-Andersen (WCA) poten- 
tial [3T), Vwca('"; e, c) = Vjj(r; e, <f) + s which is cutoff at 
r c - 2 1 / ( V. This potential allows for smaller neighbour lists 
with less particles, speeding up the simulation by up to 40%. 
Further, we will discuss smoothed versions of the potentials, 
which possess a continuous second derivative by multiplying 
with the function g(x = (r - r c )/hcr) = x 4 /(l + jc 4 ). The fol- 
lowing scrutiny is based on a system of N - 10,000 particles 
at p* = 0.75 and T* = 1.12. For either the LJ or the WCA 
potential, all results share the same initial configuration, which 
was obtained by equilibrating an initial fee lattice in the NVT 
ensemble with /i* = 1, Sf = 0.001 and h = 0.005 over 10 s steps 
using the GPU (double-single precision) implementation. 

4.1. Momentum conservation 

Momentum conservation implies that the interaction between 
particles does not generate additional momentum or, equiva- 
lently, a drift of the centre-of-mass (cm.) velocity or mean 
particle velocity, v cm = (vi) N . For the summation of forces, 



a standard index loop over the neighbouring cells yields first 
a large force contribution to one direction, which is later can- 
celled by the forces from particles on the opposite side. Such 
an implementation in single precision results in a clear drift of 
the cm. velocity for the WCA potential as illustrated in Fig.|4^; 
the drift is an order of magnitude smaller in the case of the LJ 
potential, Fig. [4}i. Using double-single precision for the sum- 
mation of forces and summing opposite cells together reduces 
the drift by factors of 20 and 2.7 for the WCA and LJ poten- 
tials, respectively (panels b and e). The drift of the cm. ve- 
locity is suppressed by another factor of 100(!) in both cases 
if the velocity- Verlet integration is done in double-single pre- 
cision too (panels c and f). Modifying the time step of the in- 
tegration or smoothing the potential does not have a significant 
effect on the velocity drift. 

4.2. Energy conservation 

Faithful simulations in the microcanonical ensemble cru- 
cially depend on the conservation of the total system energy, 
E = Ejjjj, + £p 0t = const. This condition is particularly sensi- 
tive to limited floating-point precision and requires particularly 
careful examination. For example, an energy drift of 2% was 
found after 10 6 MD steps using the single precision version of 
GROMACS (which is the default) |32"1 . 

For our GPU and host implementations using both single 
and double precision, we have examined the evolution of the to- 
tal system energy, displayed in Fig. [5] In addition, we compare 
the degree of energy conservation for WCA and LJ potentials 
with and without smoothing and for two different time steps. 
The single precision GPU implementation shows a clear drift 
of about 0.1% per 10 7 MD steps for both potentials (panels a 
and b). There is a contribution from the cm. velocity drift, 
which is, however, orders of magnitude smaller. Smoothing 
the potentials at the cutoff does not improve energy conserva- 
tion here. For the WCA potential, the time step St* - 0.003 is 
obviously too large; energy conservation is considerably more 
degraded if the potential is smoothed. Using double-single pre- 
cision for the summation of forces and in the velocity- Verlet 
algorithm (panels d and e) reduces the energy drift by factors 
of about 3 (LJ potential) to 7 (WCA). Additional smoothing of 
the potentials, however, improves energy conservation signif- 
icantly by an extra order of magnitude. Thereby, the drift is 
almost eliminated in the case of 6t* = 0.001 down to 5xl0~ 6 
and 6xl0~ 5 per 10 7 steps for the WCA and the LJ potential, 
respectively. Such a tiny drift, however, is dominated and ob- 
scured by numerical fluctuations. 

The host simulation results shown in panels c and f under- 
mine that the degradation of energy conservation due to single 
precision is not specific to the GPU implementation. While the 
quantitative evolution of the energy in panels e and f differs due 
to the implementation of floating-point arithmetic in either 44- 
bit double-single precision (GPU) or native 53-bit double pre- 
cision (CPU), the orders of magnitude of energy conservation 
are comparable. 
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Figure 4: Drift of the centre-of-mass velocity for simulation runs on the GPU if forces are accumulated in single precision and naive order (panels a,d), or in 
double-single precision and opposite cell order (b,e), and if double-single precision is used for both summation of forces and the velocity- Verlet integration (c,f). 
The two rows compare the purely repulsive WCA potential (a-c) and the Lennard-Jones potential (d-f). Each panel contains data for two different time steps 
(St* = 0.003 and 0.001) with and without smoothing the potentials (h = 0.005 and 0). 




Figure 5 : Conservation of total energy per particle during a long simulation run. The first two columns refer to GPU results (panels a,b,d,e) and the last column to 
the host implementation (c,f). Single floating-point precision was used for the top row and double-single or double precision for the bottom row. The sensitivity 
on the precision is compared for the WCA and LJ potentials with and without smoothing and for two different time steps as indicated. Note the different y-axis of 
panel a. 
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Figure 6: Scaling test for the numerical fluctuations of the total energy around the initial value on the GPU (left and centre column) and the host (right column). 
Results using single precision are shown in the top row, double-single (or native double) precision was used in the bottom row. A smoothed LJ potential was used 
in the centre and right columns (h = 0.01). 



4.3. Numerical energy fluctuations 

The analysis of the numerical energy fluctuations yields a 
sensitive test of the numerical accuracy which requires only 
short simulation runs. Newton's differential equations are dis- 
cretised by the velocity-Verlet algorithm to first order in the 
time step 5t, introducing a discretisation error of order 6t z — 
provided the potential possesses a continuous second deriva- 
tive. Thus, one expects that the total energy shows numerical 
fluctuations around its initial value which scale as 6t 2 . In par- 
ticular, [E(t) - E(Q)]/6t 2 should roughly be independent of the 
time step. 

For time steps between 5t* = 0.001 and 0.005, these rescaled 
energy fluctuations are displayed in Fig. [6] for the LJ poten- 
tial. Using double-single precision on the GPU, all five curves 
nicely collapse (panel d) over the range of 200 to 1000 integra- 
tion steps. Merely the largest time step, 5f = 0.005, is slightly 
off, indicating that higher order terms become relevant in this 
case. Smoothing the potential at the cutoff improves the col- 
lapse, from which small time steps benefit especially (panel e). 
If only single precision arithmetic is used on the GPU, the col- 
lapse is poor and smoothing the potential at the cutoff has essen- 
tially no effect (panels a and b). The two smallest time steps are 
completely off, which we attribute to the quick accumulation of 
rounding errors from the tiny increments. 

For comparison, we have added the results from the host 
implementation in single and double precision (panels c and f). 
The almost identical behaviour as on the GPU corroborates that 
the discussed effects are due to the limited precision only, and 
no other numerical artifacts are introduced by the GPU. An in- 



significant, but nevertheless interesting difference between the 
GPU (double-single precision) and the CPU (double precision) 
results are the high-frequency fluctuations visible at the small- 
est time step 6f = 0.001 for the GPU case (panels d and e). 
These tiny fluctuations are caused by the evaluation of the in- 
dividual pair force contributions in single precision. We veri- 
fied that the high-frequency fluctuations disappear if the entire 
force algorithm on the GPU is implemented in double-single 
precision at the cost of significantly reduced performance. 

5. Application to glassy dynamics of a supercooled liquid 

We employed the above GPU implementation of a molec- 
ular dynamics simulation to reproduce the slow dynamics of a 
supercooled liquid of soft spheres. We resorted to the Kob- 
Andersen (KA) binary mixture [33-35], which has proven to 
be a useful model system that reliably delays crystallisation^] 
Specifically, we have investigated a relatively large system of 
40,000 A and 10,000 B particles of equal masses m interacting 
with Lennard-Jones potentials, V a p(r) = Vu(>~, SaB, o~ub) with 
the parameters chosen as in Ref. 33: ctaa - 1> o~bb = 0.88, 
cr AB = 0.8, Baa = T Sbb = 0.5, and e AB = 1.5. The time step 
of the Verlet integrator was taken to be 6t* = 0.001 in units of 

■Jtno^A I saa ■ Equilibration runs were done for fixed energy and 
covered about 10 times the structural relaxation time; all results 



We could not find any signs of crystallisation for a very long thermostat 
run of a small system of 1,500 particles at 7™ = 0.4 and p* = 1.2 over a time 
span of 10 8 LJ units or 5x 10 9 MD steps. 
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Figure 7: Simulation results for a Kob-Andersen mixture obtained on a GPU using double-single arithmetic. Mean-square displacements (a) and incoherent 
intermediate scattering functions (b) of the A particles are shown for different temperatures approaching the glass transition at fixed density p* = 1.2. Broken lines 
indicate results with single floating-point precision. 



were obtained for two independent systems with different ini- 
tial conditions. Production runs of lxlO 6 to 2xl0 8 steps (for 
T* = 0.60 and 0.43, respectively) including the online evalua- 
tion of the correlation functions merely took between 1 .4 hours 
and 9.6 days of real time. 

For the following discussion, we need some basic quanti- 
ties central to the description of glassy dynamics. The simplest 
quantity to characterise the transport dynamics is the mean- 
square displacement (MSD) 6r 2 (t) = ((r,(f) - r,-(0)))^, where 
(•) denotes a microcanonical time average and {-) N an average 
over all particles, i = l,..,,N. Here, we restrict the discussion 
to A particles. The diffusion constant D of a tagged particle is 
then obtained via 6r 2 (t) a 6Dt for t — > oo. A more complex 
quantity is the time-correlation function of local density fluc- 
tuations, the intermediate scattering function (ISF). The self- 
part of the ISF is defined as F s (q, = ((p, \t)P-{(fl))) using 
the Fourier components of the tagged particle density, p q (t) = 
exp(-iq-r (s) (f)), with discrete wavenumbers q e (7.njL)l? and 
r (s) denoting any of the r,. Rotational symmetry implies that 
F s (q, f) = F s (q, f) merely depends on the magnitude of the 
wavevector q — |q|, and additional averaging can be done over 
the orientation of q. To quantify the structural relaxation time 
T a , we adopt the usual definition F s (q max ,T a ) = 1/e, where 
<7max = 7.25cr _1 denotes the maximum of the structure factor. 

The MSD develops a pronounced plateau for decreasing 
temperature, reflecting the caging by the arrested surroundings; 
see Fig. [7^. The diffusion coefficient is drastically suppressed 
as the temperature approaches an anticipated glass transition 
temperature T g and spans more than 3 decades over the simu- 
lated temperature range. Correspondingly, the density correla- 
tors shown in Fig. [Tj? develop a plateau, which is a signature 
of the structural arrest. The slow dynamics is quantified by the 
diverging structural relaxation time T a , for which we observe 
an increase by 4 orders of magnitude. 



T' 






D* , 

single 


error 


T* . 
a, d.-s. 


T* ■ , 
a; single 


error 


0.48 


1.02x10 


-4 


1.03xl0~ 4 


1% 


2.31xl0 3 


2.26xl0 3 


2% 


0.46 


4.48x10" 


-5 


5.12xl0- 5 


14% 


6.43 xlO 3 


5.69xl0 3 


12% 


0.44 


1.73x10- 


-5 


2.43 xl0~ 5 


40% 


2.19xl0 4 


1.71xl0 4 


22% 


0.43 


1.02x10" 


-5 


oo 




4.24xl0 4 


2.50xl0 4 


41% 



Table 1: Dimensionless diffusion constants D* and relaxation times r* of A 
particles for temperatures close to the glass transition. The table compares sim- 
ulation results obtained with double-single and single floating-point precision 
and gives the relative error due to the limited precision. 



A clear observation of the glassy dynamics requires suffi- 
cient separation between short-time features and the slow struc- 
tural relaxation. Hence, the temperature has to be fine-tuned 
close to the transition, presupposing a sharp value for T g . More- 
over, simulation runs become very long and the energy of the 
system must be extremely stable during a complete run; we 
could limit the energy drift to merely 3xl0~ 5 over 2xl0 8 MD 
steps at T* = 0.43. It turns out that such a long-time stabil- 
ity cannot be maintained with single floating-point precision. 
For low temperatures T* = 0.43,0.44,0.46,0.48, we have per- 
formed additional production runs with single precision, and 
both the MSD and the density correlators deviate significantly 
at long times, see Fig. [7] (broken lines). The system heats up 
during the simulation, which introduces physical artifacts in the 
form of a faster relaxation at the pretended temperature. The 
determined diffusion constants and structural relaxation times 
for both levels of precision are compared in Table [T] revealing 
quantitative differences of up to 41 %. At the lowest investigated 
temperature, the system appears to become even super-diffusive 
at long times; note the crossing of the MSD curve at T* = 0.43 
obtained with single precision and the one at T* = 0.44 using 
double-single precision in Fig. [7^. 

Further, the use of current graphics processors facilitates 
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the investigation of much larger system sizes than were usu- 
ally accessible before. Among other benefits, data quality is 
enhanced as statistical fluctuations of tagged particle quantities 
are expected to scale as 1 / V^V- First results for the velocity au- 
tocorrelation function display an excellent signal-to-noise ratio 
and shed light on novel power-law correlations at low tempera- 
tures Il36t 

6. Conclusion 

We have shown how recent graphics processing units can be 
harnessed to carry out large-scale molecular dynamics simula- 
tions in the microcanonical ensemble with strict energy con- 
servation even after 10 8 MD steps. Using GPU computing, 
we were able to reproduce the slow glassy dynamics of a bi- 
nary Lennard-Jones mixture over 4 nontrivial decades in time. 
Single floating-point precision, however, is not sufficient for 
this purpose and may result in qualitatively and quantitatively 
wrong results; e.g., the diffusion coefficient was found to di- 
verge at T* = 0.43 and to deviate by up to 40% at higher 
temperatures due to the limited energy conservation. We have 
shown that the mediocre native double precision performance 
of recent GPUs can be overcome by implementing numerically 
critical parts of the MD algorithm with double-single precision 
floating-point arithmetic, which is based on single precision in- 
structions. 

The described MD simulation package is fully implemented 
on the GPU using CUDA, and it avoids costly memory transfers 
of trajectory data between host and GPU. In addition to earlier 
work IT] [21 , the sorting of particles in memory and the evalua- 
tion of dynamic correlation functions are completely performed 
on the GPU. The number of simulated particles is solely lim- 
ited by the amount of global device memory. On the NVIDIA 
GeForce GTX 280 providing 1 GB of memory, simulations of 
864,000 particles are possible, but the barrier of one million 
particles is broken on the NVIDIA Tesla C1060 with 4 GB of 
(somewhat slower) memory. Our performance measurements 
show speedups of 70 to 80 compared to a serial simulation on 
the host processor, and we have shown that the GPUs we have 
deployed perform similarly to LAMMPS on a modern, conven- 
tional HPC system running in parallel on 64 processor cores. 
We have found that the use of double-single precision in spe- 
cific parts of the algorithm increases the execution times by 
merely 20%, which we attribute mainly to a doubling of mem- 
ory accesses. While the use of native double precision arith- 
metic in the next generation of GPUs will reduce the number 
of floating-point operations compared to double-single preci- 
sion, we expect that the performance penalty due to the latency 
of global memory access will remain. In particular, the trade- 
off between performance optimisation and numerical accuracy 
in terms of floating-point precision will persist for GPUs as it 
does for conventional processors. 

In summary, current graphics processors provide a powerful 
and robust means for state-of-the-art simulations of simple and 
complex liquids in general and for numerical studies on glass- 
forming liquids in particular. Substantial computing resources 
can be delivered already by local GPU clusters containing a 



few dozen high-end GPUs, which are affordable in terms of ac- 
quisition cost and maintenance for a single institute and which 
are likely to play a considerable role in future simulation-based 
research. In addition, some national computing centres have 
started to support GPU-accelerated computing on large dedi- 
cated GPU clusters, which are useful at the single-GPU level 
already. They will, however, be fully exploited only by the fur- 
ther development of simulation packages running on distributed 
GPUs (see Refs. 6-8 , 10 for examples), enabling the routine in- 
vestigation of large and complex systems that can be studied 
today on exceptionally few supercomputers only. 
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